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Abstract 

With  the  ever  increasing  volumes  of  information  generation,  users  of  information  systems  are 
facing  an  information  overload.  It  is  desirable  to  support  information  filtering  as  a  complement 
to  traditional  retrieval  mechanism.  The  number  of  users,  and  thus  profiles  (representing  users 
long-term  interests),  handled  by  an  information  filtering  system  is  potentiaUy  huge,  and  the 
system  has  to  process  a  constant  stream  of  incoming  information  in  a  timely  fashion.  The 
efficiency  of  the  filtering  process  is  thus  an  important  issue. 

In  this  paper,  we  study  what  data  structures  and  algorithms  can  be  used  to  efficiently  perform 
large-scale  information  filtering  under  the  vector  space  model,  a  retrieval  model  estabUshed  as 
being  effective.  We  apply  the  idea  of  the  standard  inverted  index  to  index  user  profiles.  We 
devise  an  alternative  to  the  standard  inverted  index,  in  which  we,  instead  of  indexing  every  term 
in  a  profile,  select  only  the  significant  ones  to  index.  We  evaluate  their  performance  and  show 
that  the  indexing  methods  require  orders  of  magnitude  fewer  I/Os  to  process  a  document  than 
when  no  index  is  used.  We  also  show  that  the  proposed  alternative  performs  better  in  terms  of 
I/O  and  CPU  processing  time  in  many  cases. 


1  Introduction 


Information  is  increasingly  available  in  electronic  form.  The  number  and  size  of  full  text  document 
databases  are  rapidly  increasing.  Users  of  such  database  systems  are  facing  an  information  over- 
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Figure  1:  Information  Filtering  Server(s) 


load;  it  b  bacoming  difflcalt  io,  u„.s  to  rd,  »ldr  on  trnditiond  ,.t,«spoclW.  s.«ob  and  lotnoval 
„.cha„isn.s  to  koop  .k=n.sd..s  apprised  of  ne.  dooumon.s  iha.  «o  «levant  to  th»  intoost.  As 
a  complement  to  conventional  seatck  mechanbm,  infotmation  systems  can  ptomde  an  in/ormattm. 
/ateriny  mechanbm,  th.ongh  which  a  nse.  snbscribes  pmHles,  o,  queries  that  am  contmno.sly  eval¬ 
uated,  to  .eptesent  hb  long-tetm  internets,  mtd  then  passively  receives  information  Mterwi  by  the 
system  according  to  the  profiles. 

Research  in  information  filtering  has  received  .lot  of  attention  lately.  However,  previous  work 
has  focused  on  the  dfectivenes.  (precision  rmd  r.caU)  of  the  filtering,  mrd  little  has  hem.  done  to 
address  the  efficUncy  (performance)  aspect  of  the  problem.  We  believe  that  information  filtermg  rs 
going  to  be  used  on  a  large  scale  mrd  hence  the  efficiency  bsne  must  be  addressed.  I.  this  paper, 
we  present  data  structure  mid  algorithms  to  support  information  filtering. 

Wide  area  information  retrieval  b  now  a  reality;  Imge-scde  world-wide  information  filtering  is 
abo  foreseeable.  Consider  a  population  of  users  mid  a  number  of  information  sources  in  .networked 
information  filtering  envbonment.  The  filtering  cmi  be  done  either  at  the  information  sources,  at 
the  user  sites,  or  at  mi  intermediate  m/ormati.n /Hleeiny  .ewer  (Figure  1).  Relying  solely  on  user 
filtering  b  expensive  since  network  bmidwidth  b  wasted  to  transmit  brelev.nt  infortn.tion  and  a  lot 
of  wasteful  local  processing  b  done.  Relying  on  filtermg  at  the  soutcws  themmlves  b  idso  expensive 
since  users  need  to  replicate  then  profiles  at  all  possible  soinces.  The  information  filtering  server  is 
a  good  comprombe.  It  collects  information  from  a  set  of  sources  aud  routes  it  to  interested  users.  Of 
course,  there  can  be  multiple  information  filtering  servers  on  the  network,  each  servicing  a  different 

set  (maybe  overlapping)  of  users  and  information  sources. 

In  thb  paper,  we  focus  on  one  information  filtering  server  mid  consider  what  data  strnctnt. 
and  algorithms  it  cmi  employ  to  speed  up  the  filtering  process.  Thb  b  important  because,  firstly, 
the  number  of  users  and  profile,  a  serve,  has  to  handle  b  potentially  huge.  Secondly,  as  the  rate 
of  information  generation  b  high,  a  filtering  server  will  have  to  pmcess  a  Urge  number  of  new 
documents  everyday,  especially  if  the  sert-er  coUects  information  from  a  number  of  sources.  Thirdly, 
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it  is  important  to  deliver  relevant  information  to  users  in  a  timely  fashion  for  such  a  service  to 
be  useful.  In  summary,  information  filtering  servers  will  have  to  handle  huge  number  of  profiles 
and  process  a  constant  stream  of  incoming  documents  in  a  timely  fashion.  Thus,  to  develop  efficient 
processing  methods  for  a  single  filtering  server  can  be  seen  as  the  first  but  important  step  in  achieving 
efficient  filtering  on  a  global  scale. 

To  further  motivate  the  need  for  efficient  information  filtering  methods,  let  us  look  at  a  popular 
information  source  today  -  Netnews.  The  study  [11]  reports  that,  as  of  January  1993,  the  total 
Netnews  readership  worldwide  is  estimated  to  be  1.9  million.  The  estimates  for  the  average  traffic 
are  49.5  MB  and  19,210  messages  per  day  (counting  cross-posted  messages  only  once).  If  we  consider 
a  Netnews  filtering  server  that  serves  a  small  fraction  (say  5%)  of  this  user  population,  and  each 
user  has  say  five  profiles,  the  server  will  have  to  handle  hundreds  of  thousands  of  profiles.  To  match 
this  large  number  of  profiles  against  a  daily  influx  of  tens  of  thousands  of  documents  in  a  timely 
fashion,  it  is  apparent  that  efficient  data  structures  and  algorithms  are  needed.  Furthermore,  keep 
in  mind  that  these  Netnews  numbers  are  for  a  single  information  source  today.  In  the  future,  one 
would  expect  many  more  sources  with  even  higher  volumes. 

Netnews  does  support  a  rudimentary  filtering  mechanism  by  categorizing  articles  into  newsgroups 
and  allowing  users  to  subscribe  to  newsgroups  of  interest.  However,  a  finer  granularity  of  information 
need  matching,  by  means  of  information  retrieval  techniques,  will  cater  much  better  to  individual 
interests.  Research  in  information  retrieval  has  given  rise  to  many  retrieval  models,  notably  the 
boolean  model,  the  vector  space  model,  and  the  probabilistic  model,  that  are  applicable  to  infor¬ 
mation  filtering  [1].  Reference  [18]  presents  data  structures  and  algorithms  for  information  filtering 
under  the  boolean  model.  In  this  paper,  we  consider  the  vector  space  model  (VSM),  which  is  widely 
recognized  as  an  effective  retrieval  model.  It  uses  a  natural  language  interface,  which  makes  it  easy 
to  use.  A  well-known  technique,  called  relevance  feedback,  provides  an  easy  way  to  improve  retrieval 
effectiveness.  Some  of  the  ideas  in  the  VSM  have  been  implemented  in  the  WAIS  system  [8].  The 
popularity  of  WAIS  demonstrates  the  appeal  of  the  VSM.  Our  methods  are  thus  for  documents  and 
profiles  represented  in  the  VSM. 

Our  algorithms  make  use  of  an  inverted  index  to  speed  up  the  filtering  process.  Inverted  indexes 
have  been  used  by  information  retrieval  systems  to  facilitate  traditional  retrospective  search,  namely 
by  building  an  index  of  documents.  In  this  paper,  we  investigate  how  the  idea  of  an  inverted  index 
can  be  used  to  speed  up  profile  processing.  Specifically,  we  propose  to  use  an  inverted  index  of 
profiles.  ^  In  the  information  retrieval  scenario,  a  user  query  is  matched  against  a  document  index. 

1  Other  retrieval  methods  (e.g.,  signature  files  [4])  can  also  be  used  to  speed  up  filtering  (e.g.,  building  a  signature 
file  of  profiles).  In  this  paper,  we  focus  on  inversion-based  methods.  Further  work  would  need  to  be  done  to  compare 
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Here,  an  incoming  document  is  matched  against  a  proWe  index.  We  investigate  what  modifications 

need  to  be  made,  and  what  alternatives  are  feasible. 

Incidentally,  we  have  implemented  two  experimental  filtering  servers  at  Stanford  to  disseminate 
Netnews  articles  and  computer  science  technical  reports.  The  reader  is  encouraged  to  try  out  these 
services.  For  instructions  on  how  to  use  these  services,  send  an  electronic  mail  message  to  either 
elib«db, Btanlord.edu  (for  technical  reports)  or  netnewsfidb. 8taiiford.edu  (for  Netnews)  with 
the  word  “help”  in  the  message  body.  Instructions  will  be  returned  automatically.  The  current 
version  of  these  servers  is  not  efficient  (it  uses  the  Brute  Force  method  described  later  on).  However, 
as  more  users  subscribe  to  our  servers,  there  is  an  obvious  need  for  an  efficient  implementation,  and 
this  motivated  the  work  reported  in  this  paper. 

The  rest  of  the  paper  is  organized  as  foUows.  In  Section  2,  we  give  a  brief  summary  of  the  VSM, 
as  appUed  to  information  filtering.  In  Section  3,  we  present  three  methods  to  process  profiles.  Details 
of  the  analysis  and  simulations  used  to  evaluate  the  performance  of  the  methods  are  described  in 
Section  4.  The  results  of  the  evaluation  are  presented  in  Section  5.  Section  6  is  a  survey  of  related 
work  and  Section  7  is  for  conclusion. 

2  VSM  Applied  to  Information  Filtering 

In  this  section,  we  give  a  brief  summary  of  the  VSM  as  used  in  information  filtering.  The  purpose  of 
this  is  to  explain  some  terminology  and  assumptions  necessary  for  the  exposition  of  our  algorithms  in 
Section  3.  For  an  in-depth  introduction  to  the  VSM  and  information  filtering  the  reader  is  referred 
to  [12]  and  [1]  respectively. 

2.1  Document  and  Profile  Vector 

In  the  VSM,  we  identify  a  document  by  a  set  of  terms.  Weights  are  assigned  to  terms  as  statistical 
importance  indications.  If  m  distinct  terms  are  available  for  content  identification,  a  document  D 
can  be  conceptually  represented  as  an  rn-dimensional  vector,  D  =  (ini, ...,  u;m),  where  Wi  is  the 
weight  assigned  to  the  i-th  term  and  is  0  for  terms  not  present  in  D.  To  compute  the  vector 
representation  of  a  document,  usually  these  steps  are  followed.  First  the  individual  words  occurring 
in  the  document  are  identified.  Words  that  belong  to  the  stop  list,  which  is  a  list  of  high-frequency 
words  with  low  content  discriminating  power,  are  deleted.  Then  a  stemmmg  routme  is  used  to 
reduce  each  remaining  word  to  word-stem  form.  For  each  remaining  word  stem  (a  term),  a  weight  is 
assigned  in  an  attempt  to  represent  how  “important”  that  term  is.  One  common  way  to  compute  the 

the  performance  of  signature-based  and  inversion-based  methods  for  information  filtermg. 
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weight  d  a  .e,„  is  to  maltiply  the  te,„  frequency  (1/)  factor  with  the  inverse  docunrent  frequency 
(id/)  factor.  The  tf  factor  is  proportional  to  the  frequency  of  the  tern,  within  the  docuntent.  The 
id/  f»ttor  correspond,  to  the  content  discrindnating  power  of  the  term:  a  term  that  appears  rarely  m 
documents  (e.g..  -queue-)  has  a  high  uff,  while  a  term  that  occurs  in  a  large  number  of  documents 
(e.g.,  -system-)  ha.  alow  id/.  *  (See  SecUon  4.1.1  for  examples  of  formula,  used  h.  calculate  these 

factors.)  £1  •  a  rv 

As  profiles  in  the  VSM  are  express^!  in  natu.d  Imrguag..  w.  can  represent  proffles  ,mt  bke 

documents.  A  profile  f>  appeam  as  P  =  . . Un.)-  Sometimes  we  Mow  the  convention  of 

writing  a  document  or  proffle  vector  as  a  vector  of  (term,  -eight)  pairs;  those  terms  not  listed 

have  weighU  equal  to  0.  Thus,  a  p.oMe  P  with  p  ncu-sero  weighted  terms  can  be  written  as 

P  =  ((p.,u.) . (vr,un)).  For  instance,  in  the  p.ohle  P  =  ((-queue-,  0.h3).  (-system-, 0.37)),  term 

-queue-  has  a  weigh.  0.93,  -system-  has  0.37,  and  aU  other  terms  have  a  sero  weight.  The  we.gh  s 
again  describe  the  “importance”  of  each  term. 


2.2  Similarity  Measure 

We  can  measure  the  degree  of  similarity  between  a  docunen.-p.oiae  pair  b<u»f  on  the  weights  of 
the  corresponding  m.tchmg  terms.  The  cosine  measure  has  been  used  for  this  purpose;  given  a 
document  D  =  (w, . -„)  and  a  ptoHe  P  =  (u. . u„),  the  coune  similarity  measure  w: 


D  •  P 


siuHD,  p) = pjpi]  - 


lu  this  paper  w.  assume  that  the  document  and  p.oSle  vectors  are  normalised  by  their  lengths;  thus 

the  above  simplifies  to:  ^ 

sim(D,  P)  =  DP^'^  WiUi. 

i=l 

2.3  Relevance  Threshold 

In  an  information  retrieval  setting,  a  query  is  nrn  against  a  database  of  documents,  and  the  relevant 
document  are  returned  to  the  user,  ranked  by  their  scores,  Le.,  the  similarity  between  the  query 
and  the  documents.  In  an  information  iUtering  setting,  a  profile  is  compared  with  a  single  document 
or  a  small  number  of  documents.  It  is  unde.bable  to  fU.er  document,  based  on  the  rank,  among  a 
smaU  batch  of  documents.  In  (5),  a  fixed  number  of  lop  ranked  document  is  returned  over  a  certaru 

a  pre-existing  reference  corpus  of  text,  as  is  done  in  [5]. 
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period  of  time.  This  is  only  possible  if  the  period  is  long  enough  to  aUow  a  significant  number 
of  documents  to  be  collected  to  make  the  ranking  meaningful;  and  in  doing  so,  the  timeliness  of 
the  documents  is  sacrificed.  Also,  the  filtering  effectiveness  (precision  and  recaU)  depends  on  the 
particular  set  of  documents  received  during  a  period.  If  aU  documents  are  relevant,  then  some  wdl  be 
nxissed  (low  recaU).  If  few  documents  are  relevant,  then  some  documents  deUvered  will  be  irrelevant 

(low  precision).  Reference  [5]  indeed  reports  such  drawbacks. 

A«  »  sugg.rted  in  [5l.  in  to  aUo«  the  nnnr  to  spocity  somn  kind  of  nbnolntn  »lnvan« 

teskold  -  docnments  nbovn  thn  thrnnhold  «n  considnf.d  fdn.a.t,  and  tho.n  below  a.e  not,  W.th 
thin  nfategy,  inntantaneonn  proeenning  of  docnmentn  b  ponnible  (i.e.,  a  doenment  ean  be  pioeenned 
one  at  a  time,  as  soon  as  it  is  seeeived).  Also,  the  p.eeision  and  meaU  of  the  fatering  a.e  independent 
of  when  it  is  perfomed.  Inte.estingly,  snch  relesranee  thmshold  can  also  be  nsed  in  conyenl.onal  m- 
foemation  retrieval;  [13]  describes  snch  an  experiment.  We  snm  np  this  discnssion  with  the  foUowmg 

definition. 

Definition  1:  Given  a  profile  P  and  a  relevance  threshold  0,  a  document  D  is  relevant  to  P  if 
sim{Dj  P)  >  0.  □ 


2.4  Relevance  Feedback 

A  weU-known  technigne  nsed  to  improve  the  elFectiveness  of  retrieval  in  relevm.ce  feedback.  Thin 
techniqne  can  be  applied  ri.  information  SUering  as  weU.  In  essence,  a  proffle  vector  can  be  an- 
tomaticaUy  reformolated  by  adding  to  it  rele.mnt  docnment  vectors  (as  jndged  by  the  nser)  and 
snbtracting  from  it  irrelevant  docnment  vectors.  A  variety  of  adjnstment  formnias  have  been  stnd. 
ied;  for  example,  one  vmiety,  called  He  Jiejnine  [14),  can  be  applied  to  information  hltermg  as 

p(i+l)  ^  p(i)  +  D-  Yj 

D  relevant  D  irrelevant 

where  P(‘)  is  the  profile  vector  after  the  i-th  feedback  iteration.  In  this  paper,  we  are  not  concerned 
with  which  exact  adjustment  formula  is  used.  Our  methods  do  not  depend  on  which  formula  is  used 
(or  if  relevance  feedback  is  used  at  all).  In  one  of  our  simulation  experiments,  we  investigate  the 
impact  on  the  performance  of  our  profile  processing  methods  when  relevance  feedback  is  used. 


3  Data  Structures  and  Algorithms 

In  this  section  we  describe  three  methods  that  match  a  document  against  a  number  of  profiles  and 
determine  the  profiles  to  which  the  document  is  relevant.  We  assume  that  a  document  is  processed 
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one  at  time,  as  soon  as  it  arrives.  Our  methods  can  easily  be  extended  to  handle  the  case  when  a 
number  of  documents  is  batched  together  for  processing,  but  we  do  not  address  this  here. 

In  two  of  the  methods,  we  make  use  of  an  inverted  index.  In  an  index,  for  each  term  x,  we 
collect  profiles  that  contain  it  to  form  an  inverted  list.  The  mapping  from  terms  to  the  location 
of  their  inverted  lists  on  disk  is  implemented  as  a  hash  table,  called  the  directory.  We  assume  that 
the  inverted  lists  are  stored  on  disk  while  the  directory  fits  in  main  memory. 

Our  focus  in  this  paper  is  on  efficient  VSM  filtering  algorithms.  The  issue  of  how  to  efficiently 
update  profiles  in  the  data  structures  is  not  addressed.  We  assume  that  such  updates  are  batched  and 
are  periodically  installed.  However,  in  the  evaluation  of  our  indexing  methods,  we  do  consider  two 
options  of  storing  inverted  lists  on  disk.  One  option  is  to  pack  all  the  Usts  into  contiguous  blocks,  and 
the  other  is  to  store  each  Ust  individually  in  an  integral  number  of  blocks.  While  handUng  updates 
in  the  first  option  requires  reading  and  writing  all  the  lists,  it  is  much  easier  m  the  second  option. 
On  the  other  hand,  the  storage  space  requirement  for  the  first  option  is  higher.  In  our  evaluation 
we  examine  the  trade-off  involved. 

3.1  Brute  Force  (BF)  Method 

If  we  store  profiles  sequentially  on  disk  without  any  index  structures,  then  aU  profiles  must  be 
evaluated  when  a  new  document  is  received.  This  is  the  Brute  Force  (BF)  method. 

When  a  document  arrives,  we  first  compute  its  vector  representation  as  described  in  Section  2. 
Then  we  examine  each  profile  in  turn.  For  each  (term,  weight)  pair  (x, «)  in  a  profile,  we  find  x’s 
weight  w  in  the  document  vector,  and  calculate  the  product  w  X  u.  The  sum  of  such  products  is 
the  cosine  similarity  measure.  The  document  is  relevant  to  a  profile  if  the  cosine  measure  is  greater 

than  the  relevance  threshold  associated  with  the  profile. 

We  store  a  profile  on  disk  as  a  variable-length  record  with  these  fields;  the  profile  identifier,  the 
length  -  i.e.,  the  number  of  terms  in  the  profile,  the  (term,  weight)  pairs,  and  finally  the  relevance 
threshold. 

3.2  Profile  Indexing  (PI)  Method 

To  reduce  the  number  of  profiles  that  must  be  examined,  we  build  an  inverted  index  of  profiles.  We 
call  this  the  Profile  Indexing  (PI)  method.  For  each  term  x,  we  coUect  all  the  profiles  that  contain 
it  to  form  its  inverted  list.  The  Ust  is  made  up  of  posting]  each  contains  the  identifier  of  a  profile 
involving  X  and  the  weight  of  x  in  it.  Thus,  an  profile  with  p  terms  wUl  be  found  in  p  postmgs;  each 

3  As  detailed  later,  we  may  coUect  bU  or  some  of  the  profiles  that  contain  a  term  to  form  it  inverted  list. 
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posting  in  a  different  list.  When  processing  a  document  D,  we  only  need  to  examine  those  profiles 
in  the  inverted  lists  of  the  terms  that  are  in  D. 

To  match  a  document  against  these  profiles,  we  need  two  (main  memory)  arrays,  THRESHOLD 
and  SCORE.  (This  method  and  the  next  use  more  main  memory  than  the  BF  method.)  The  number 
of  entries  in  each  array  is  equal  to  the  number  of  profiles  the  system  handles.  Each  profile  has  an 
entry  in  each  array:  the  THRESHOLD  entry  stores  the  relevance  threshold,  and  the  SCORE  entry 
is  used  to  keep  the  score  of  the  profile. 

Wh™  a  docamant  D  arri.es,  we  iailialiae  the  SCORE  Mtay  to  all  O’s.  For  eaeh  tern  .  with 
weight  w  m  the  docameat,  we  ase  the  directory  to  retriere  a's  ia.erted  list.  Then  we  process  e«h 
prolUe  P  ia  the  list.  That  is,  if  the  weight  of  »  in  P  is  a,  we  iacteiaeat  SCORE|P]  by  the  prodact 
of  »  X  a.  After  all  domareat  terms  are  processed,  a  proile  whose  SCORE  entry  U  greater  thm.  the 

THRESHOLD  entry  matches  the  document. 

To  illustrate,  consider  three  profiles: 


Pi  =  ((a,  0.46),  (6, 0.14),  (c,  0.17),  (d,  0.62),  (e,  0.59)) 

$1  =  0.25 

P2  =  ((0,0.95),  (6, 0.30)) 

02  =  0.20 

P3  =  ((c,  0.14),  (e,  0.49),  (/,  0.17),  {g,  0.42),  {h,  0.11),  (i,  0.10),  {j,  0.72)) 

03  =  0.25 

The  inverted  index  for  these  profiles  is  shown  in  the  right-hand  side  of  Figure  2.  For  example, 
the  a  list  contains  the  postings  for  P,  and  P^.  The  0.46  value  in  the  first  entry  in  this  Ust  is  the 
weight  of  a  in  Pi-  Now  suppose  this  document  arrives: 

D  =  ((6, 0.15),  (d,  0.32),  (/,  0.21),  {h,  0.14),  {j,  0.90)). 

To  process  this  document,  first  we  read  the  b  list,  and  increment  the  SCORE  entries  of  Pi  and  P^ 
by  0.15  X  0.14  =  0.021  and  0.15  x  0.30  =  0.045  respectively.  The  lists  of  d,  /,  h,  and  j  are  processed 
similarly.  The  final  values  of  the  SCORE  array  are  as  shown  in  the  figure.  This  document  is  relevant 

to  P3. 

Notice  the  PI  method  is  almost  symmetrical  to  the  method  used  in  information  retrieval  to  match 
a  query  against  a  database  of  documents  with  an  index  of  documents,  with  the  roles  of  documents 
and  queries  (profiles)  reversed.  The  difference  is  that  the  THRESHOLD  array  is  not  used;  instead, 
after  the  computation  of  similarities,  the  SCORE  array  is  sorted  to  find  the  rank  of  the  documents. 
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Figure  2:  Data  Structures  for  Profile  Indexing 
3.3  Selective  Profile  Indexing  (SPI)  Method 

In  the  PI  method,  we  index  a  profile  by  all  its  terms.  In  this  subsection  we  investigate  an  alternative 
in  which  we  only  select  a  number  of  terms  for  indexing. 

Consider  the  term  6  in  Pi  in  our  running  example.  Suppose  a  document  arrives  and  it  does 
not  contain  the  terms  o,  c,  d,  or  e.  The  maximum  score  Pi  could  have  against  this  document  is 
0.14  (if  6’s  weight  in  the  document  is  the  highest  possible,  1.0),  which  is  less  than  the  threshold 
specified.  At  a  threshold  of  0.25,  the  term  b  is  insignificant  in  that  it  alone  cannot  produce  enough 
score  for  a  document  to  be  relevant.  Thus,  we  may  choose  not  to  index  the  profile  with  the  term 
6  -  a  document  that  contains  only  b  and  no  other  terms  in  the  profile  will  not  be  relevant  anyway. 
However,  a  document  that  contains  b  and  another  term  in  the  profile  may  be  relevant;  so  we  need 
to  duplicate  (5,0.14)  in  the  postings  of  the  other  terms  in  their  respective  lists.  (Since  we  assume 
that  the  inverted  lists  are  stored  on  disk,  it  is  better  to  duplicate  the  pair  than  to  store  it  elsewhere 
and  keep  a  pointer  in  the  postings  to  reference  it  (extra  I/Os  will  be  needed  to  look  it  up).  If  the 
entire  index  fits  in  main  memory,  it  is  better  to  use  the  pointer  option.  See  comments  in  Section  7.) 

Similarly,  consider  the  subvector  ((h,  0.11),  (i,  0.10))  in  P3.  Suppose  a  document  arrives  that 
does  not  have  the  other  terms  in  P3.  Then  an  upper  bound  to  the  similarity  between  P3  and  this 
document  is  0.11  +  0.10  =  0.21  (we  can  actually  find  a  tighter  upper  bound,  by  a  theorem  proved 
below).  Again,  with  a  threshold  of  0.25,  the  subvector  is  insignificant.  In  this  case,  we  may  choose 
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not  to  post  the  profile  in  the  inverted  lists  of  h  and  t  and  duplicate  the  pairs  in  the  postings  of  the 
other  terms  in  the  profile.  These  observations  lead  us  to  this  definition. 

Definition  2:  Given  a  profile  vector  P  =  ((l/i,  u,), {Vp,  Up)).  a  subvector  P.  =  {{Vi. ,  Ui  J,  .... 
{Vi  ))i  1  •••  <  ^  P'  ^  insignificant  at  a  threshold  of  9  if  for  any  document  D, 

Given  a  profile  like  P3,  there  may  be  several  insignificant  subvectors  (e.g.,  {{h,  0.11),  (i,  0.10))  is 
one,  ((c,  0.14),  (i,  0.10))  is  another).  Which  subvector  should  we  use  to  reduce  the  number  of  index 
postings?  One  idea  is  to  use  the  subvector  that  contains  the  most  low-id/  terms.  Low-id/  terms 
occur  more  frequently  in  documents;  thus,  by  not  posting  these  terms  we  expect  to  save  the  most 

lookup  work. 

Definition  3:  Given  a  profile  vector  P  =  {{vi,  ui), ....  {Vp,  «p)).  a  subvector  P,  =  ((w, .  u,,).  .... 

1  <  ii  <  -  <  <  P.  is  insignificant  at  a  threshold  of  9  if  it  has  the  largest  number 

of  lowest  idf  terms  among  the  insignificant  subvectors  at  a  threshold  of  fl.  □ 

Assuming  id/s  are  distinct,  a  profile  vector  has  a  unique  most  insignificant  subvector  at  a  given 
threshold.  We  need  a  way  of  checking  whether  a  subvector  is  the  most  insignificant  subvector  and 
this  requires  the  abiUty  to  compute  the  maximum  possible  simUarity  between  a  profile  subvector 
and  any  document  vector.  Intuitively,  we  can  see  that  the  similarity  between  a  profile  subvector 
and  any  unit  document  vector  is  highest  when  the  document  vector  is  “in  the  same  direction”  as 
the  profile  subvector.  And  if  that  happens,  the  simUarity  is  given  by  the  magnitude  of  the  profile 
subvector.  This  is  formally  stated  and  proved  as  follows. 

Theorem  1:  For  any  P  and  any  D,  ||D||  <  1,  sim{D,P)  <  ||P|i. 

Proof:  This  follows  easily  ftom  the  Cauchy-Schwarz  InequaUty  [6]: 

sim{D,  P)  =  DP<\DP\<  <  Ill’ll-  ■ 

To  find  the  most  insignificant  subvector  of  a  profile  vector,  we  can  sort  the  terms  by  idf  and 
include  as  many  terms  as  possible.  For  example,  consider  P3  again.  We  assume  that  the  term 
weights  are  directly  proportional  to  the  id/s  (which  is  true  if  the  if  components  are  the  same).  As 

||((c,  0.14),  (h.  0.11),  (i,  0.10))11  =  0.2042  <  0.25.  and 

||((/,  0.17),  (c,  0.14),  (h,  0.11),  (i,  0.10))11  =  0.2657  >  0.25, 

((c,  0.14),  {h,  0.11),  (i,  0.10))  is  the  most  insignificant  subvector  of  P3  at  a  threshold  of  0.25.  This  also 
shows  thlt  Theorem  1  b  stronger  than  the  naive  way  of  finding  an  upper  bound  by  simply  adding 
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Figure  3:  Data  Structures  for  the  SPI  Method 


the  weights,  as  we  have  done  earlier. 

With  this  knowledge,  we  can  indeed  index  the  profiles  selectively.  For  each  profile,  we  find  the 
most  insignificant  subvector  at  the  threshold  specified.  The  profile  is  then  posted  in  the  inverted  lists 
of  the  significant  (relative  to  the  most  insignificant  subvector)  terms.  In  each  posting,  we  include 
the  insignificant  terms  and  their  weights;  i.e.,  they  are  duplicated  in  the  lists  of  all  the  significant 
terms.  This  is  called  the  Selective  Profile  Indexing  ( SPI)  method. 

Each  posting  contains  the  profile  identifier,  the  weight  of  the  term  indexed,  the  number  of 
insignificant  pairs,  and  the  pairs  of  insignificant  terms  and  weights.  Postings  in  the  same  list  are 
stored  sequentially  in  blocks. 

We  also  require  the  THRESHOLD  and  SCORE  arrays  as  in  the  PI  method.  When  a  document 
comes  along,  we  construct  its  vector  representation.  Next  we  initialize  the  SCORE  array  to  all  0  s. 
Then  we  index  the  directory  to  retrieve  the  inverted  lists  of  each  term.  Suppose  we  are  processing 
the  term  x  with  weight  w  in  the  document.  For  each  profile  P  in  the  x  Ust,  suppose  the  weight  of  a:  in 
P  is  u,  and  the  insignificant  pairs  are  ...,  (yi.,  ^t.)-  We  examine  P’s  SCORE  entry.  There 

are  two  cases;  if  the  SCORE  entry  is  zero,  we  first  add  the  product  w  x  u.  Then  we  look  up  each 
term  yi-  in  the  document  vector.  Suppose  its  weight  in  the  document  is  Wij.  We  add  the  product 
Wij  X  Uij  to  the  SCORE  entry.  In  the  second  case,  the  SCORE  entry  is  not  zero,  meaning  that  we 
have  already  added  the  contribution  of  the  insignificant  terms  in  some  earlier  computation.  Thus 
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we  only  add  the  product  w  x  u.  After  aU  document  terms  have  been  processed,  a  profile  matches 

the  document  if  its  SCORE  entry  is  greater  than  the  THRESHOLD  entry. 

Figure  3  shows  ihe  index  tor  our  running  exanrple.  For  instance,  suppose  we  «e  processing  the 
first  pair  (I,  0.15)  trom  the  doennrent  v«:tot.  The  list  ot  5  has  only  one  posting,  that  of  P,.  We  M 
the  product  0.15  X  0.30  =  0.045  to  Ps's  SCORE  entry.  As  there  is  no  insignilicant  snhvector,  we 
are  done  with  this  posting  and  also  with  the  4  list.  Next  we  process  the  pair  (d,  0.32).  Only  Pi’s 
posting  is  in  the  d  list.  First  we  add  the  product  0.32  x  0.62  =  0.1984  to  SCORE[P.l.  Then  we 
process  the  insigniS^mt  snb^ctor  ((4,  0.14),  (c,  0.17)).  To  do  this,  we  look  np  the  tern.  4  in  the 
docment  vector,  getting  a  weight  of  0.15.  Thus  we  increment  SCOREIPJ  by  the  product  0.15  X 
0  14  =  0  021.  Nmct,  we  look  np  c.  which  is  not  in  the  document  vector.  We  are  now  done  w.th  Ihrs 
list  The  other  pairs  are  processed  simiUrly.  The  final  values  for  SCORE  are  as  shown  in  the  figure. 


4  Performance  Evaluation 

4.1  Models 

We  use  analysis  and  simulations  to  evaluate  the  performance  of  the  methods.  To  allow  flexibihty 
in  our  performance  evaluation,  we  use  synthetic  document  and  profile  models.  To  make  them 
realistic,  we  base  our  models  on  properties  of  a  database  of  Netnews  (text)  articles  received  by  our 
Department’s  Netnews  host  during  the  period  of  April  22  to  April  29,  1993.  A  total  of  212,972 
articles  were  collected,  making  up  a  550MB  database.  Below  we  describe  our  models. 

4.1.1  Document  Model 

The  foUowing  steps  were  carried  out  to  study  the  occurrence  frequency  of  terms  in  the  database. 
First,  a  lexical  analysis  screened  out  all  non-alphabetical  characters  from  the  documents  (i.e.,  arti¬ 
cles).’  Then  a  stemming  routine  (Porter’s  algorithm  [10])  was  run  to  reduce  the  remaining  words  to 
word-stem  form.  Each  stem  thus  obtained  is  a  term.  Next  we  measured  the  occurrency  frequency 
of  each  term  in  the  database,  obtaining  the  plot  shown  in  Figure  4  (note  the  log/log  scale).  The 
x-intercept  (i.e.  size  of  the  term  vocabulary,  which  we  denote  by  v)  is  found  to  be  521,915.  The 
straight  line  in  the  graph  was  derived  by  curve  fitting  using  [17].  We  can  see  the  database  does 
demonstrate  Zipfian  characteristics  [19].  Also,  the  average  number  of  words  per  document  (denoted 

by  d)  is  found  to  be  323. 

Hence,  we  come  up  with  the  following  probabilistic  document  model.  The  terms  in  a  document 
come  from  a  vocabulary  V  of  size  u.  Each  term  is  uniquely  represented  by  an  integer  x,l<x<v. 
The  probability  that  any  term  appears  is  described  by  the  probability  distribution  Z.  We  rank  the 
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Figure  4:  Term  Rank  vs  Term  Frequency  Graph  for  Netnews  Database 


terms  in  non-increasing  order  of  frequencies,  i.e.,  Vx,y,l  <  x  <  y  <  v,  vre  have  Z(x)  >  Z(y);  for 
convenience,  we  use  the  rank  to  identify  the  terms.  We  assume  the  frequency  distribution  foUows 

Zipf’s  Law;  i.e.. 


A  document  has  d  term  occurrences  and  is  generated  by  a  sequence  of  d  independent  and  identically 
distributed  trials;  each  trial  produces  one  term  from  V  according  to  the  distribution  Z.  The  most 
frequent  s  terms  form  the  stop  list;  stop-listed  terms  are  deleted  from  a  document  before  its  vector 
representation  is  computed.  We  choose  s  to  be  100  in  the  evaluation. 

Finally,  the  vector  representations  of  the  documents  are  computed  as  described  in  Section  2.  The 
exact  formulas  used  to  compute  the  weight  of  a  term  Xf  are  from  [13],  which  have  been  empirically 
found  to  be  effective; 


tfi 


idfi 


0.5 -I- 0.5  X  — and 
max  jj 
0 

log(l/fraction  of  documents  with  Zi), 


where  fi  is  the  frequency  of  the  term  Xi  in  the  document.  We  analyticaUy  compute  the  fraction  in 
idf  as  the  probability  that  x*  appears  in  a  document. 
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4.1.2  Profile  Model 


Looking  nt  on.  dnt.bnoo,  »o  find  Ih.t  a  U.ge  traction  of  the  terms  in  the  vocabuU.y  ocen,  very 
infrequently.  Those  terms  are  mostly  from  misspellings,  typos,  or  self-invented  words.  We  do  not 
enpeet  these  terms  to  appear  in  profUes,  which  represent  long  term  interests.  We  model  this  by 
.....wleC  that  profUe  terms  are  chosen  torn  the  set  «  =  {s-h  1, caUed  the  qneried  vocabulary, 

out  of  the  vocabulary  V  =  {1 . •}.?<»■  (KecaU  that  we  are  identifying  terms  by  their  ranks.)  A 

bmm  value  of  50000  is  chosen  for  ,,  covering  more  than  97%  of  the  total  occurrences  of  terms  in  the 
Netnews  database. 

W.  assume  that  each  term  m  Q  is  equally  likely  to  be  chosen  for  a  proOe.  This  uniform  distri¬ 
bution  is  justified  as  qumies  tend  to  use  a  mb.  of  frmjnmit  and  relatively  infrequent  words  (16].  Also, 
terms  rarely  occur  more  than  once  in  a  proSle  1121;  thus  we  assume  that  a  profile  is  a  set  of  p  terms 

chosen  randomly  without  replacement  from  the  queried  vocabulary  Q. 

The  number  of  profiles  in  the  system  is  n.  To  simplify  the  study  of  the  effect  of  profile  sire  on 
performance,  we  assume  all  profiles  have  the  same  length,  i.e.,  p  is  fixed  for  all  profiles. 

Some  of  these  assumptions  may  not  be  valid  when  relevance  femiback  is  used.  In  the  evaluation  of 
the  methods  under  relevance  feedback,  we  modify  our  profile  model  in  the  evaluation  of  the  methods 
under  relevance  feedback. 

4.1.3  Choice  of  Relevance  Threshold 

It  is  hmd  to  model  the  relevance  threshold  distribuUon.  For  a  user,  a  suitable  relevance  threshold 
for  his  profile  depends  on  the  individual  profile  terms  (their  id/s),  the  degree  of  correlation  among 
the  terms,  the  amount  of  relevmit,  a.  weU  a.  irrelevant,  information  in  the  incoming  stream,  and  his 
desired  level  of  precision  and  recall  (is  it  crucial  to  receive  all  possibly  relevant  documents,  or  is  it 

more  desirable  to  receive  those  that  are  likely  to  be  relevant?) 

Instead  of  deriving  a  eompheated  model  of  relevance  threshold,  we  asmime  the  relevance  threshold 
is  fixed  lor  all  profiles.  This  allows  us  to  study  clearly  its  impact  on  the  methods.  A  reasonable  base 
case  value  was  found  by  the  foUowing  procedure.  First  a  random  document  was  generated.  Then  a 
profile  was  crea.^1  to  contain  a  number  of  overUpping  terms,  randomly  selected  from  the  document. 
The  similarity  between  the  document  mid  the  profile  was  computed.  The  procedure  was  repeated  a 
Urge  number  of  times.  For  a  base  case  profile  length  of  5.  we  found  that  a  profile  with  4  or  more 
matching  terms  has  an  average  simiUrity  of  about  0.2.  Thus  we  use  this  as  the  base  value  of  the 
relevance  threshold  for  our  evaluation.  Of  cour«,  this  U  not  saymg  that  the  relevance  threshold 
simply  translates  to  the  number  of  matching  terms.  We  are  merely  settling  with  a  reasonable  stMt.ng 
point  in  our  evaluation.  In  Section  5.6,  we  vary  the  threshold  over  the  entire  range  of  possible  values 
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from  0  to  1  and  examine  its  effect  on  the  performance. 


Parameter 

Base  Value 

Description 

t; 

521915 

size  of  vocabulary 

d 

323 

term  occurrences  per  document 

s 

100 

end  of  stop  list 

q 

50000 

end  of  queried  vocabulary 

n 

300000 

#  profiles 

P 

5 

terms  per  profile 

e 

0.2 

relevance  threshold 

i 

4  1 

#  bytes  for  profile  identifier 

1 

2 

#  bytes  to  represent  length  of  profile 

t 

4 

#  bytes  to  represent  a  term 

f 

4 

#  bytes  to  represent  a  floating  point  number 

b 

512 

#  bytes  in  a  disk  block 

Table  1:  Summary  of  Parameters  Used  in  Performance  Evaluation 


Table  1  summarizes  the  parameters  used  in  the  models,  together  with  some  parameters  that 
specify  the  sizes  of  various  fields  in  the  data  structures,  and  the  disk  block  size.  Keep  in  mind  that 
the  base  values  shown  are  simply  starting  points  for  our  evaluation.  We  explore  different  sets  of 
values  in  our  experiments  —  Section  5  shows  some  of  the  results. 

4.2  Metrics 

We  compare  the  methods  with  respect  to  their  space  and  time  requirements.  For  space  requirement, 
we  look  at  how  much  disk  space  each  structure  takes.  (Although  main  memory  space  requirements 
of  the  methods  differ,  we  assume  they  fit  in  main  memory.)  We  study  two  ways  of  stormg  the 
inverted  lists  in  the  indexing  methods:  the  first  is  to  pack  all  lists  contiguously  into  sequentially 
blocks,  leaving  no  disk  space  in  between  lists;  the  second  way  is  to  store  each  list  in  an  integral 
number  of  blocks,  allowing  easy  list  expansions.  By  comparing  the  space  requirement  for  these  two 
options,  we  can  see  the  amount  of  internal  fragmentation  the  second  option  produces. 

For  time  requirement,  in  an  I/O  bound  system,  the  critical  measure  is  the  number  of  I/O’s  to 
process  a  document;  in  a  CPU  bound  system  (including  the  case  when  a  large  portion  of  the  data 
structures  can  be  cached  in  main  memory),  the  amount  of  computation  is  the  critical  component. 
Hence,  we  look  at  both  aspects  in  our  comparison.  For  the  CPU  computation,  we  count  the  num¬ 
ber  of  floating-point  multiplications  each  method  requires  to  process  a  document.  The  number  of 
multiplications  is  one  of  the  major  computation  costs  in  processing  a  document,  so  we  believe  it  is 
a  good  measure  of  CPU  cost. 
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In  suiiinia.ry,  we  look  at  these  metrics. 


.  the  expected  total  disk  space  required  in  number  of  blocks  (with  contiguous  allocation  and 
fragmented  allocation  for  indexing  methods), 

•  the  expected  number  of  disk  reads  needed  to  match  a  document,  and 
.  the  expected  number  of  floating  point  multipUcations  performed  to  process  a  document. 


4.3  Analysis  and  Simulations 

Except  lh<»e  for  the  SPI  method,  the  result,  in  the  Section  5  ™e  oblmned  by  deriving  mralytiod 
solutions  nnd  then  .nmericnlly  evnlrntting  the  expressions.  This  subsection  contmns  the  detnds  ot 

the  analysis. 


4.4  Brute  Force  (BF)  Method 

The  space  requirement  for  the  BF  method  is  simply  the  number  of  profiles  times  the  size  of  each 
record;  and  as  aU  profiles  are  read  to  process  a  document,  the  number  of  blocks  read  per  document 

is  the  same: 


Tbf  =  ^BF  — 


^n(i  +  / 


+  i+p(f +/)-, 
b  ' 


Next,  we  derive  an  useful  expression  for  later  analysis:  the  number  of  distinct  terms  in  a  document 
D  that  fall  in  the  queried  vocabulary.  This  can  be  derived  as  follows.  For  any  term  s  in  the  queried 
vocabulary,  the  probability  that  a  term  in  I>  is  x  is  equal  to  Z{x).  So  the  probability  that  it  is  not 
X  is  1  -  Z{x).  The  probability  that  x  does  not  appear  in  D  is  (1  -  Z{x)Y.  FinaUy,  the  probabUity 

that  X  does  appear  in  is  1  —  (1  ^ 

The  expected  number  of  distinct  terms  in  D  that  are  in  the  queried  vocabulary  is 


9 

d  =  ^Pr(x  is  in  D) 

=  E(i  -  (1  - 


The  tot  J  numb.,  ol  tems  «.«ni.cd  pc.  document  i.  np.  Fttmtion  ^  of  them  ere  expected  fo 
occui  in  the  document.  Thus,  the  expected  numbei  of  multipHatious  petfoimed  is: 


Mbf  =  npx 


q-s 
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4.5  Profile  Indexing  (PI)  Method 

Assuming  the  lists  are  packed  contiguously,  the  total  disk  space  required  for  the  PI  method  is. 

C  _  rnp{i  +  f). 

Tpi-\  1- 

Now  if  we  assume  that  lists  are  not  packed,  we  have  to  calculate  the  length  of  each  list.  We 
consider  the  question:  given  Af  postings,  each  of  size  Tl,  that  are  to  be  placed  in  a  number  of  lists, 
what  is  the  expected  number  of  blocks  in  a  certain  list,  if  the  block  size  is  B  and  the  probability 
that  a  posting  falls  in  this  list  is  VI  Let  us  denote  this  expression  by  C{Af,V,'Jl,B). 

Intuitively,  we  can  compute  the  expected  number  of  postings  in  the  list  as  AfV  and  compute  the 

expected  number  of  blocks  as 

Mvn 

B 

However,  this  is  incorrect  as  it  neglects  the  internal  fragmentation  that  results  when  the  postings 
do  not  fully  occupy  an  integral  number  of  blocks.  The  formula 


is  incorrect  also,  as  it  always  overestimates  the  number  of  blocks  required.  (For  example,  if  V  is 
very  very  small,  the  expected  number  of  blocks  should  be  small  (less  than  1),  yet  the  formula  gives 
1  no  matter  how  small  V  is.  ) 

Let  us  now  derive  a  correct  expression  for  the  value.  Let  random  variable  H  be  the  number  of 
postings  in  the  list.  H  follows  the  binomial  distribution  Bin[.A/^ ,V].  Let  random  variable  J  be  the 
number  of  blocks  in  the  list.  H  and  J  are  related  by 


We  want  to  find  J5[J].  First  we  compute  the  following  probability. 


Pr{J=i}  =  Pr{r^l=i} 

=  Pr{i -!<  —  <;} 

=  ^  Bin[/i;  ATj'P]. 
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To  officiontly  evaluate  the  la.t  sum,  tve  use  the  uo.mal  apptoxim.tion  tvheu  upp.opri.te,  uud  the 
poissou  approjimatiou  when  that  is  not  applicable.  Finally,  the  expression  that  we  are  after  is  thus 

=  i;[J] 

=  = 
i>o 

j>0 

Now  we  proceed  with  the  analysis  of  the  PI  method.  For  a  particular  list,  the  maximum  number 
of  postinp  that  can  be  placed  in  it  is  n.  (Although  the  total  number  of  postings  in  the  index 
structure  is  „p,  at  most  only  n  ot  them  can  be  on  the  same  list.)  The  probability  that  a  posting 
is  in  a  list  is  The  ptoHle  identiher  and  term  weight  is  kept  in  a  posting,  so  the  posting  sise  is 
1  +  /.  The  expected  number  of  blocks  in  each  list  is  thus  T(n.  5^,*  +  />^)- 
The  expected  total  size  is  then 

TSj  =  C{n,  +  /,*»)  X  (9  -  *)• 

The  expected  number  of  lists  read  is  d,  so  the  expected  number  of  blocks  read  per  document  is 

Rpi  =  C{n,  ^  -,i  +  ft  b)  X  d. 

g  —  5 

The  number  ot  multipKcations  is  the  same  «  that  of  the  BF  method  -  mi,  multipUcation  that 
must  be  done  in  the  BF  method  must  still  he  done  in  the  PI  method.  Thus,  we  have 

d 

Mpi  =npx 


4.6  Simulations 

Simulalious  were  conducted  to  obtain  the  results  for  the  SPI  method.  We  idso  constructed  simula¬ 
tions  to  vahdate  the  analysis.  The  simulation  results  did  match  the  analytical  ones. 

We  wrote  out  simulation  program  in  C.  The  program  litst  generates  1.  prolUes  according  to 
the  ptoffle  model,  and  then  computes  the  sise  of  the  index  structures  needed  to  store  the  prohles. 
Next  the  simulation  program  generates  a  document  according  to  the  document  model  and  counts  the 
number  of  disk  reads  and  multipbcations  needed  to  match  it  against  the  n  profUes.  For  each  scenario 
we  have  tested,  the  program  is  ru.  miough  times  (with  diltere.t  rmidom  number  generator  seeds) 
to  make  sure  that  the  results  ate  within  ±5%  of  the  true  values,  with  a  90%  level  ot  coniidence. 
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5 


Results 


5.1  Base  Case  Results 

The  results  for  the  base  case  are  given  in  Table  2.  In  the  case  when  the  inverted  lists  of  the 
indexing  methods  are  packed  contiguously,  the  total  space  requirement  for  the  three  methods  are 
roughly  comparable.  PI  is  better  than  the  BF  method,  since  the  threshold  Vcilues  are  stored  in 
main  memory.  The  SPI  method  requires  more  space  than  PI,  because  some  (term,  weight)  pairs  are 
duplicated  in  a  number  of  lists  in  the  index. 

When  the  inverted  lists  are  not  packed,  but  are  stored  individuedly  in  an  integral  number  of 
blocks,  internal  fragmentation  leads  to  an  increase  in  total  space  requirement  of  about  68%  for  SPI 
to  113%  for  PI.  The  split-list  strategy  allows  for  easier  updates,  but  we  have  to  pay  the  price  of 
higher  total  space  requirement. 

For  the  number  of  disk  reads  performed  per  document,  we  see  orders  of  magnitude  improvement 
of  the  indexing  methods  over  the  BF  method.  The  SPI  method  is  best,  due  to  the  fact  that  certain 
frequent  terms  in  a  profile  are  not  indexed.  For  this  same  reason,  the  number  of  multiplications  for 
SPI  is  lower  than  that  for  BF  and  PI  (the  latter  two  perform  the  same  number  of  multiplications; 
see  the  analysis). 


Method 

Contiguous 
Size  (Blocks) 

Fragmented 
Size  (Blocks) 

Disk 

Reads 

Multiplications 

BF 

29297 

- 

29297 

4314 

PI 

23438 

49900 

144 

4314 

SPI 

29630 

49804 

127 

3434 

Table  2:  Results  for  the  Base  Case 


In  what  follows,  we  describe  several  sensitivity  studies  in  which  we  vary  the  parameter  values. 

5.2  Size  of  Queried  Vocabulary 

The  first  parameter  that  we  exercise  is  q,  which  controls  the  size  of  the  queried  vocabulary.  Figures 
5  to  7  show  the  results. 

In  Figure  5,  the  total  space  requirement  for  the  BF  method,  as  weU  as  the  indexing  methods 
when  the  contiguous-list  strategy  is  used,  is  insensitive  to  q.  However,  when  the  split-list  strategy 
is  used  for  the  indexing  methods,  their  space  requirement  does  vary  with  q.  The  fluctuations  in  the 
graph  for  SPI  can  be  explained  as  follows.  When  q  is  20000,  each  inverted  list  occupies  2  blocks. 
As  q  increases,  the  number  of  lists  increases,  and  so  the  toted  size  increases.  At  the  same  time,  the 
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Figure  5:  Total  Size  vs.  5 


number  of  postings  in  a  lists  decreases,  since  they  are  distributed  over  a  larger  number  of  lists.  At 
some  point  (around  q  =  30000),  the  lists  begin  to  shrink  in  sise  to  1  block,  and  this  explains  the 
drop  in  total  size.  Thereafter,  the  total  space  requirement  increases  linearly  with  5,  as  each  Ust  fits 
in  1  block.  The  same  reasoning  can  be  applied  to  the  fluctuations  in  the  graph  of  PI. 

Figure  6  shows  the  results  for  the  number  of  blocks  read  per  document.  The  number  of  blocks 
read  for  the  BF  method  is  constantly  equal  to  its  total  space  requirement,  and  thus  the  graph  is 
omitted  to  show  the  variations  in  the  other  methods  better.  The  sharp  drop  in  the  number  of  I/Os 
required  corresponds  to  the  shrinking  of  the  list  length  (from  2  blocks  to  1  block).  Thereafter,  the 
number  of  I/Os  increases,  as  the  number  of  lists  read  per  document  increases  (due  to  the  increase 
in  the  queried  vocabulary  size).  The  rise  is  more  prominent  in  PI  than  in  SPI. 

For  the  number  of  multiplications  per  document  (Figure  7),  SPI  is  better  throughout  than  the 
other  methods.  The  trend  is  downward  for  all  methods,  as  more  infrequent  terms  appear  in  profiles. 


5.3  Profile  Length 

The  next  parameter  that  we  vary  is  the  profile  length.  Figures  8  to  10  show  the  results. 

For  contiguous  allocation,  we  see  the  total  space  requirement  grows  with  p  for  all  methods  (Figure 
8).  For  fragmented  aUocation,  with  a  small  p,  the  inverted  lists  each  fit  in  one  block,  so  the  size 
remains  constant  at  the  queried  vocabulary  size.  With  larger  p,  the  lists  grow  in  length,  so  the  total 
space  requirement  grows  also.  The  SPI  method  grows  at  a  faster  rate  than  the  PI  method. 

The  number  of  disk  reads  required  by  the  SPI  method  initiaUy  decreases  as  p  is  increased  from 
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Figure  6:  Number  of  Blocks  Read  Per  Document  vs.  q 


Figure  7:  Number  of  Multiplications  Per  Document  vs.  q 


Blocks  Blocks 


Mult ipl icat ions 


Figure  10:  Number  of  Multiplications  Per  Document  vs.  Profile  Length  p 

1  (Figure  9).  This  is  because  it  becomes  more  likely  that  a  profile  includes  mfrequent  terms  and 
is  thus  indexed  by  those  terms.  With  the  longer  lists  at  larger  p  (greater  than  7),  its  performance 
deteriorates  and  then  stabilizes.  On  the  other  hand,  for  the  number  of  multipUcations,  SPI  is  always 
better  than  the  two  other  methods  (Figure  10). 

5.4  Number  of  Profiles 

We  vary  the  number  of  profiles  from  100000  to  800000.  For  the  total  space  requirement  (results 
shown  in  Figure  11),  we  have  a  similar  graph  as  that  for  p.  For  contiguous  allocation,  the  space 
requirement  grows  linearly  with  n.  For  fragmented  allocation,  the  space  required  is  at  first  constant 
and  then  increases.  Each  inverted  list  fits  in  1  block  at  the  beginning,  but  as  n  increases,  2  blocks 
are  needed  to  hold  a  list.  The  lists  grow  at  a  faster  rate  in  the  SPI  method  mitially,  but  PI  soon 
catches  up  with  it. 

Figure  12  shows  the  results  for  the  number  of  disk  I/Os  required  per  document.  Those  for  the 
BF  method  are  omitted.  We  see  there  is  a  range  of  n  values  where  SPI  requires  more  I/Os  per 
document;  this  happens  when  an  SPI  inverted  list  grows  faster  than  a  PI  list.  When  the  list  length 
becomes  the  same  in  both  methods,  SPI  again  becomes  better  PI. 

In  terms  of  number  of  multipUcations  per  document,  aU  methods  scale  proportionally  to  the 
number  of  profiles,  with  the  SPI  method  always  better  than  the  other  two  methods.  Due  to  space 
considerations,  we  omit  the  graphs  here. 
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Blocks 


Figure  12;  Number  of  Blocks  Read  Per  Document  vs.  Number  of  Profiles  n 


Figure  13:  Number  of  Multiplications  Per  Document  vs.  n 
5.5  Relevance  Threshold 

The  next  parameter  that  we  vary  is  the  relevance  threshold.  Although  it  may  not  make  sense  to 
have  threshold  value  of  0  or  1,  we  study  the  entire  range  of  possible  values  to  confirm  our  intuition 
about  the  SPI  method.  The  other  methods  are  insensitive  to  the  relevance  threshold. 

With  6  increasing,  we  expect  a  more  substantial  portion  of  a  profile  to  be  insignificant  and  be 
duplicated  in  the  lists  of  significant  terms  in  SPI.  Thus  the  total  index  size  increases,  but  as  6 
increases  further,  the  insignificant  portion  is  posted  in  fewer  lists  (the  number  of  significant  terms 
decreases).  Thus,  a  certain  maximum  would  be  reached  somewhere  in  the  range.  This  is  indeed  the 
case  for  our  results  shown  in  Figure  13. 

Although  the  total  size  increases  and  then  decreases  with  increasing  6,  the  number  of  I/Os  is 
always  decreasing  (Figure  14),  because  profiles  are  indexed  in  fewer  lists  of  lower  frequency  terms. 
Similarly,  the  number  of  multiplications  decreases  sdso  (Figure  15). 

The  relative  performance  of  SPI  against  the  other  two  does  not  vary  much  with  different  values 
of  6.  For  the  space  requirement,  it  almost  always  requires  more  space  that  the  other  two,  except 
when  6  is  close  to  1.  For  the  time  requirement,  it  is  always  no  worse  than  the  other  methods. 


5.6  Document  Size 

The  size  of  documents  only  affects  the  two  time  requirement  metrics.  The  performance  of  the 
methods  with  respect  to  both  metrics  scales  proportioncilly  to  the  document  size,  with  no  change  in 
relative  performance. 
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Blocks 


Figure  14:  Total  Size  vs.  Relevance  Threshold  6 


Figure  15:  Number  of  Blocks  Read  Per  Document  vs.  Relevance  Threshold  6 
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Figure  18:  Number  of  Multiplications  Per  Document  vs.  Document  Size  d 
5.7  Relevance  Feedback 

We  perform  simulations  to  evaluate  the  methods  when  relevance  feedback  is  used.  First  we  describe 
the  setting  of  our  parameters  to  model  the  effects  of  relevance  feedback. 

As  relevant  document  vectors  are  added  to  the  profile  vector,  new  terms  are  introduced  to 
the  profile  vector.  Potentially,  the  number  of  terms  in  a  profile  becomes  arbitrarily  large.  This 
is  expensive  in  terms  of  both  profile  storage  and  document  processing  time.  As  shown  in  [14],  a 
compromise  is  to  expand  the  profile  vector  up  to  a  certain  maximum  number  of  terms.  Terms  with 
low  weights  are  discarded.  This  may  result  in  a  slight  drop  in  retrieval  effectiveness,  but  is  important 
in  keeping  down  the  storage  and  processing  costs  [14].  Thus,  in  our  simulations,  we  assume  that  the 
length  of  a  profile  (p)  is  fixed  at  40. 

Another  effect  from  relevance  feedback  is  that,  as  relevant  document  vectors  are  added  to  and 
irrelevant  document  vectors  ate  subtracted  from  a  profile  vector,  the  “interesting’  terms  in  the  profile 
vector  will  accumulate  high  weights,  while  the  other  not  so  relevant  terms  will  have  lower  weights. 
To  illustrate,  consider  a  user  who  subscribes  a  profile  on  say  “information  filtering.”  After  receiving 
and  reviewing  filtered  documents,  he  modifies  his  profile  by  relevance  feedback.  The  modified  profile 
is  expected  to  have  high  weights  for  words  “information”  and  “filtering,”  as  weU  as  related  words  on 
the  same  topic,  such  as  “selective,”  “dissemination,”  “alert,”  and  so  on.  Other  words  in  the  profile 
are  somewhat  related,  but  not  as  important,  for  example  “retrieval”  or  “document.” 

Using  the  feedback  formula  (1)  in  Section  2.4,  the  modified  weight  of  a  term  Xi  (before  normal- 
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Blocks 


Figure  19:  Total  Size  vs.  Extra  Weight  Factor  r 


ization)  is 

idfi  X  (  ^  tfj,i  -  */*=>«)) 

Dj  relevant  irrelevant 

where  tfj,i  (or  is  the  term  frequency  factor  of  term  Xi  in  document  Dj  (or  Dk).  Let  us  call 

the  expression  inside  the  parentheses  in  (2)  the  cumulative  term  frequency  (ctf)  of  the  term  x^. 

To  keep  the  simulations  simple,  we  make  the  assumption  (based  on  the  discussion  above)  that  the 
terms  in  a  modified  vector  fall  into  two  categories:  interesting  and  non-interestmg;  m  each  category, 
the  ctf’s  of  the  terms  are  roughly  equal.  In  other  words,  we  assume  the  non-interesting  terms  aU 
have  a  ct/  of  say  a,  and  the  interesting  terms  have  a  c±f  of  say  ra  (i.e.,  they  are  r  times  larger). 

To  form  a  profile  in  our  simulations,  we  fix  the  number  of  “interesting”  terms  to  5.  Then  we 
randomly  select  p  =  40  terms  from  the  queried  vocabulary  Q.  Out  of  these  terms,  we  randomly 
select  five  of  them  to  be  the  “interesting"  terms.  The  non-interesting  terms  are  given  weights  equal 
to  their  idf’s,  and  the  interesting  terms  are  given  weights  r  times  their  idf’s.  Then  the  vector  is 
normalized.  (We  do  not  need  to  pick  a  value  for  a,  as  it  would  be  normalized  out  anyways.)  We 
vary  the  extra  weight  factor  (r)  from  1  to  30  in  the  simulations. 

The  results  of  the  simulations  are  shown  in  Figures  19-21.  We  observe  that  vrith  a  large  profile 
size,  the  SPI  Method  takes  up  a  lot  more  space  than  the  BF  and  PI  Methods.  This  is  because  of 
the  repUcation  of  the  insignificant  terms  in  the  lists  for  the  significant  terms.  This  also  leads  to 
more  I/Os  per  document  matched.  On  the  other  hand,  in  terms  of  the  number  of  multipUcations 
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Figure  20:  Number  of  Blocks  Read  Per  Document  vs.  Extra  Weight  Factor  r 
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Figure  21:  Number  of  MultipUcations  Per  Document  vs.  Extra  Weight  Factor  r 
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per  document,  the  SPI  Method  is  a  lot  better  than  the  other  two  methods. 

Only  the  SPI  Method  is  sensitive  to  the  extra  weight  factor  r.  In  Figure  19,  we  see  that  the  total 
size  of  the  SPI  index  increases  with  r.  As  more  and  more  weight  is  given  to  the  “interesting”  terms, 
they  become  more  and  more  significant.  Finally,  when  r  is  about  10,  the  only  significant  terms  are 
the  “interesting”  terms.  Thus,  the  size  of  index  becomes  constant.  The  same  reasoning  explains  the 
shape  of  the  SPI  graphs  in  Figures  20  and  21. 

6  Related  Work 

References  [2,  5,  9]  investigate  the  effectiveness  of  different  retrieval  models  applied  to  information 
filtering. 

In  [18],  we  study  what  index  structures  can  be  used  to  speed  up  information  filtering  under  the 
boolean  model.  The  PI  and  SPI  methods  presented  in  this  paper  can  be  seen  as  generalizations  of 
the  Counting  and  Key  methods  in  [18]. 

Terry  et  al.  [15]  propose  the  notion  of  continuous  queries  in  relational  databases.  Users  issue 
continuous  queries,  which  are  rewritten  into  incremental  queries  and  run  periodically.  Their  work 
concentrates  on  relational  databcises,  while  ours  is  concerned  with  the  dissemination  of  unstructured 
data  (documents)  using  information  retrieval  techniques. 

Related  to  the  idea  of  a  profile  index  is  that  of  the  “segment  tree”  presented  in  [3].  There,  Danzig 
et  al.  present  a  distributed  indexing  scheme  as  a  way  to  provide  efficient  retrospective  search  of  a 
large  number  of  retrieval  systems.  Special  sites,  called  index  brokers,  maintain  indexes  of  remote 
retrieval  systems.  They  subscribe  “generator  queries”  that  keep  them  informed  of  changes  in  these 
systems.  The  segment  tree  is  proposed  to  index  numerical  generator  queries  over  Library  of  Congress 
numbers  (e.g.,  aU  new  items  in  the  range  QA76  to  QA77).  Index  structures  for  general  profiles  are 
not  addressed. 


7  Conclusion 

In  this  paper,  we  study  what  data  structures  and  algorithms  can  be  used  to  facilitate  large-scale 
information  filtering  under  the  VSM.  We  apply  the  idea  of  the  standard  inverted  index  to  index 
user  profiles  (we  call  this  the  PI  method)  and  show  that  only  slight  modifications  are  needed  to  use 
the  index  to  speed  up  filtering.  We  devise  an  alternative,  called  the  SPI  method,  to  the  standard 
inverted  index  -  instead  of  indexing  every  term  in  a  profile,  we  select  only  the  significant  ones  to 
index.  We  evaluate  their  performance,  together  with  the  BF  method  which  uses  no  profile  index. 
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In  summary,  we  see  that  the  three  methods  require  approximately  the  same  disk  space  when 
inverted  Usts  are  packed  into  contiguous  blocks.  When  lists  are  stored  individually  in  an  integral 
number  of  blocks,  the  indexing  methods  require  more  disk  space  than  the  BF  method.  On  the  other 
hand,  when  we  compare  the  time  requirement,  the  BF  method  is  the  clear  loser.  The  indexing 
methods  require  fewer  number  of  I/Os  to  match  a  document  by  orders  of  magnitude.  Among  the  PI 
and  SPI  methods,  SPI  is  always  better  in  terms  of  CPU  processing.  It  can  also  improve  the  number 
of  I/Os  required  in  many  cases,  depending  mainly  on  the  profile  length  and  the  number  of  profiles. 

Although  in  those  cases  where  SPI  wins,  the  difference  may  appear  small,  we  should  remember 
that  the  results  shown  are  for  processing  a  single  document.  An  information  server  will  be  doing 
this  matching  day  in  and  day  out,  and  the  difference  will  be  magnified.  Another  observation  is  that 
as  SPI  is  always  the  best  in  CPU  processing,  when  main  memory  is  large  enough  to  hold  the  entire 
index,  SPI  is  the  clear  choice.  In  that  case,  instead  of  dupUcating  insignificant  terms  in  lists  of 
indexed  terms,  we  can  just  use  a  pointer  to  reference  the  insignificant  terms,  stored  separately. 
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